Fixes #27950: [Datalake] JSON columns incorrectly typed as STRING for empty dict values#27951
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes an ingestion bug in the Datalake connector where JSON-like columns (especially empty {} / [] values coming from single-object JSON files) were incorrectly inferred as STRING, and where parsing children could emit repeated TypeError debug logs.
Changes:
- Update column type inference to treat non-null object columns as candidates even when values are falsy containers, and avoid unnecessary
ast.literal_evalfor already-parseddict/listvalues. - Rework JSON children extraction to handle mixed parsed-
dictand JSON-string values withoutTypeErrornoise. - Add unit + fixture-based tests covering parsed objects, empty containers, and single-object JSON ingestion behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
ingestion/src/metadata/utils/datalake/datalake_utils.py |
Fixes type inference for empty dict/list values and makes get_children robust to parsed objects vs JSON strings. |
ingestion/tests/unit/utils/test_datalake.py |
Adds targeted tests for fetch_col_types/get_children and fixture-driven single-object JSON parsing. |
ingestion/tests/unit/resources/datalake/dbt_manifest.json |
Adds a representative single-object dbt manifest fixture with multiple empty-object fields. |
ingestion/tests/unit/resources/datalake/dbt_catalog.json |
Adds a representative single-object dbt catalog fixture with nested dicts and nulls. |
🟡 Playwright Results — all passed (21 flaky)✅ 4064 passed · ❌ 0 failed · 🟡 21 flaky · ⏭️ 86 skipped
🟡 21 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
Code Review ✅ ApprovedReplaces lexicographic type resolution with explicit precedence in datalake utils to correctly identify JSON and array columns. Added comprehensive unit tests to ensure accurate type detection for empty containers and mixed-type inputs. OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|
Failed to cherry-pick changes to the 1.12.7 branch. |
|
Failed to cherry-pick changes to the 1.13 branch. |



Describe your changes:
Fixes #27950
Changes in OpenMetadata submodule (
datalake_utils.py):JSON/ARRAYinstead ofSTRINGast.literal_evalround-trip for already-parseddict/listvaluesget_childrenhandles parsed dicts and JSON strings independently — no moreTypeErrorlog spamTests added (
tests/unit/utils/test_datalake.py):fetch_col_typesandget_childrenwith parsed objects, empty containers, mixed types_read_json_object → _get_columnspipelineType of change:
Checklist:
Fixes <issue-number>: <short explanation>Bug fix
Summary by Gitar
max()type resolution with_TYPE_PRECEDENCEmapping infetch_col_types.dict,list) from being incorrectly downgraded toSTRINGin mixed-type columns.TestFetchColTypesMixedTypesto verify correct resolution for mixeddict/str,list/str, and numeric column types.This will update automatically on new commits.